Optimal Receding Horizon Control for Finite Deterministic Systems 

with Temporal Logic Constraints 

Maria Svorenova, Ivana Cerna and Calin Belta 



o 

(N 



o 

C/3 

o 



> 
m 
co 
in 
rn 

cn 

o 

m 

> 

X 

S3 



Abstract — In this paper, we develop a provably correct 
optimal control strategy for a finite deterministic transition 
system. By assuming that penalties with known probabilities of 
occurrence and dynamics can be sensed locally at the states of 
the system, we derive a receding horizon strategy that minimizes 
the expected average cumulative penalty incurred between two 
consecutive satisfactions of a desired property. At the same 
time, we guarantee the satisfaction of correctness specifications 
expressed as Linear Temporal Logic formulas. We illustrate the 
approach with a persistent surveillance robotics application. 

I. Introduction 

Temporal logics, such as Computation Tree Logic (CTL) 
and Linear Temporal Logic (LTL), have been customarily 
used to specify the correctness of computer programs and 
digital circuits modeled as finite-state transition systems [1]. 
The problem of analyzing such a model against a temporal 
logic formula, known as formal analysis or model checking, 
has received a lot of attention during the past thirty years, and 
several efficient algorithms and software tools are available 
[2], [3], [4]. The formal synthesis problem, in which the goal 
is to design or control a system from a temporal logic spec- 
ification, has not been studied extensively until a few years 
ago. Recent results include the use of model checking al- 
gorithms for controlling deterministic systems [5], automata 
games for controlling non-deterministic systems [6], linear 
programming and value iteration for synthesis of control 
policies for Markov decision processes [1], [7]. Through the 
use of abstractions, such techniques have also been used for 
infinite systems, such as continuous and discrete-time linear 
systems [8], [9], [10], [11], [12]. 

The connection between optimal and temporal logic con- 
trol is an intriguing problem with a potentially high impact 
in several applications. By combining these two seemingly 
unrelated areas, our goal is to optimize the behavior of 
a system subject to correctness constraints. Consider, for 
example, a mobile robot involved in a persistent surveillance 
mission in a dangerous area and under tight fuel / time 
constraints. The correctness requirement is expressed as a 
temporal logic specification, e.g., "Alternately keep visiting 
A and B while always avoiding C", while the resource 
constraints translate to minimizing a cost function over the 
feasible trajectories of the robot. While optimal control 
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is a mature discipline and formal synthesis is fairly well 
understood, optimal formal synthesis is a largely open area. 

In this paper, we focus on finite labeled transition systems 
and correctness specifications given as formulas of LTL. We 
assume there is a penalty associated with the states of the sys- 
tem with a known occurrence probability and time-behavior. 
Motivated by persistent surveillance robotic missions, our 
goal is to minimize the expected average cumulative penalty 
incurred between two consecutive satisfactions of a desired 
property associated with some states of the system, while 
at the same time satisfying an additional temporal logic 
constraint. Also from robotics comes our assumption that 
actual penalty values can only be sensed locally in a close 
proximity from the current state during the execution of the 
system. We propose two algorithms for this problem. The 
first operates offline, i.e., without executions of the system, 
and therefore only uses the known probabilities but does 
not exploit actual penalties sensed during the execution. 
The second algorithm designs an online strategy by locally 
improving the offline strategy based on local sensing and 
simulation over a user-defined planning horizon. While both 
algorithms guarantee optimal expected average penalty col- 
lection, in real executions of the system, the second algorithm 
provides lower real average than the first algorithm. We 
illustrate these results on a robotic persistent surveillance 
case study. 

This paper is closely related to [13], [14], [5], which 
also focused on optimal control for finite transitions systems 
with temporal logic constraints. In [5], the authors developed 
an offline control strategy minimizing the maximum cost 
between two consecutive visits to a given set of states, subject 
to constraints expressed as LTL formulas. Time-varying, 
locally sensed rewards were introduced in [13], where a 
receding horizon control strategy maximizing rewards col- 
lected locally was shown to satisfy an LTL specification. This 
approach was generalized in [14] to allow for a broader class 
of optimization objectives and reward models. In contrast 
with [13], [14], we interpret the dynamic values appearing 
in states of the system as penalties instead of rewards, i.e., 
in our case, the cost function is being minimized rather than 
maximized. That allows the existence of the optimum in 
expected average penalty collection. In this paper, we show 
how it can be achieved using automata-based approach and 
game theory results. 

In Sec. |IlJ we introduce the notation and definitions 
necessary throughout the paper. The problem is stated in 
Sec. IlIT] The main results of the paper are in Sec. 
Sec. |V| The simulation results are presented in Sec. 
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II. Preliminaries 

For a set S, we use S w and S + to denote the set of all 
infinite and all non-empty finite sequences of elements of S, 
respectively. For a finite or infinite sequence a = aoai . . ., we 
use a(i) = a; to denote the i-th element and aW = ao . . . 3\ 
for the finite prefix of a of length |aW| = i + 1. 

Definition 1: A weighted deterministic transition system 
(TS) is a tuple T = (S,T, AP, L,w), where S is a non- 
empty finite set of states, T C S* x 5 is a transition relation, 
AP is a finite set of atomic propositions, L: S — > 2 AP is 
a labeling function and w: T — > N is a weight function. 
We assume that for every s £ S exists s' £ S 1 such that 
(s, s') e T. An initialized transition system is a TS T = 
(S, T, AP, L, w) with a distinctive initial state s init £ 5. 

A run of a TS T is an infinite sequence p = s si . . . € S" 
such that for every i > it holds (sj,Sj+i) € T. We use 
inf(p) to denote the set of all states visited infinitely many 
times in the run p and Run r (s) for the set of all runs of 
T that start in s £ S. Let Run r = [J seS Run r (s). A finite 
run a — s ■ ■ ■ s n of T is a finite prefix of a run of T and 
Run^ n (s) denotes the set of all finite runs of T that start 
in s e 5. Let Run^ n = Uses -^ un fin( s )- The length \a\, or 
number of stages, of a finite run a = sq . . . s n is n + 1 and 
last(a) = s n denotes the last state of a. With slight abuse 
of notation, we use w(a) to denote the weight of a finite run 
a = s . . . s n , i.e., w(<r) = w (( s i> s i+i))- Moreover, 

w* (s, s') denotes the minimum weight of a finite run from s 
to s'. Specifically, w*(s, s) = for every s £ S and if there 
does not exist a run from s to s', then w*(s, s') = 00. For a 
set S" C 5 we let w*(s, S') = min ty*(s, s'). We say that a 

state s' and a set 5' is reachable from s, iff w*(s, s') 7^ 00 
and w*(s, S") 7^ 00, respectively. 

Every run p = s si . . . £ Run r , resp. a = s . . .s n £ 
Run^ n , induces a word z = L(s )£(si) ... £ (2 AP ) W , resp. 
z = L(s ) ■ ■ ■ L(s„) £ (2 AP ) + , over the power set of AP. 

A cycle of the TS T is a finite run eye = c . . . c m of T 
for which it holds that (c m ,c ) € T. 

Definition 2: A sub-system of a T = (S 1 , T, AP, L, w) 
is a TS W = (Su,T u ,AP,L\ u ,w\u), where S»CS and 
Tu Q T n(S u x We use L| M to denote the labeling 
function L restricted to the set Su- Similarly, we use w\u 
with the obvious meaning. If the context is clear, we use L, w 
instead of L\u, w\u- A sub-system U of T is called strongly 
connected if for every pair of states s, s' £ Su, there exists a 
finite run a £ Run^/ n (s) such that last(a) = s'. A strongly 
connected component (SCC) of T is a maximal strongly 
connected sub-system of T. We use SCC(T) to denote the 
set of all strongly connected components of T. 

Strongly connected components of a TS T are pairwise 
disjoint. Hence, the cardinality of the set SCC(T) is bounded 
by the number of states of T and the set can be computed 
using Tarjan's algorithm [15]. 

Definition 3: Let T = (S, T, AP, L, w) be a TS. A control 
strategy for T is a function C : RunJ n — > 5 such that for 
every cr <s Run^ n , it holds that (last(a),C(a)) £ T. 



A strategy C for which C(<7i) = C(a 2 ), for all finite 
runs (7i,a-2 € R un fin w i m last(a\) — lastfa), is called 
memoryless. In that case, C is a function C : £ — > S. 

A strategy is called finite-memory if it is defined as a tuple 
C = (M, next, A, start), where M is a finite set of modes, 
A : M x S -> M is a transition function, next : M x 5 ->• S 1 
selects a state of T to be visited next, and start: S — > M 
selects the starting mode for every s £ S. 

A run induced by a strategy C for T is a run pc* = 
sqSi . . . £ Run r for which s, + i = C(/9^' 1 ) for every i > 0. 
For every seS, there is exactly one run induced by C that 
starts in s. A finite run induced by C is ac £ Run^, which 
is a finite prefix of a run pc induced by C. 

Let C be a strategy, finite-memory or not, for a TS T. 
For every state s £ S, the run p c £ Run r (s) induced by C 
satisfies inf(p c ) Q S u for some U £ SCC(T) [1]. We say 
that C leads T from the state s to the SCC U. 

Definition 4: Linear Temporal Logic (LTL) formulas over 
the set AP are formed according to the following grammar: 

4> ::= true lal^l^V^I^A^IX^I^U^IG^lF^, 
where a £ AP is an atomic proposition, -1, V and A are 
standard Boolean connectives, and X (next), U (until), G 
(always) and F (eventually) are temporal operators. 

The semantics of LTL is defined over words over 2 AP , 
such as those generated by the runs of a TS T (for details 
see e.g., [1]). For example, a word w £ (2 AP ) U ' satisfies G <j> 
and F <p if <j> holds in w always and eventually, respectively. 
If the word induced by a run of T satisfies a formula (f>, we 
say that the run satisfies <j>. We call <f> satisfiable in T from 
s £ S if there exists a run p £ Run T (s) that satisfies <j>. 

Having an initialized TS T and an LTL formula <j> over 
AP, the formal synthesis problem aims to find a strategy 
C for T such that the run pc £ Run T (s init ) induced by 
C satisfies <fi. In that case we also say that the strategy C 
satisfies <f>. The formal synthesis problem can be solved using 
principles from model checking methods [1]. Specifically, <j> 
is translated to a Biichi automaton and the system combining 
the Biichi automaton and the TS T is analyzed. 

Definition 5: A Biichi automaton (BA) is a tuple B = 
(Q,2 AP ,5,qo, F), where Q is a non-empty finite set of 
states, 2 AP is the alphabet, S C Q x 2 AP x Q is a transition 
relation such that for every q £ Q, a £ 2 AP , there exists 
q' £ Q such that (q, a, q') £ 5, qo £ Q is the initial state, 
and F C Q is a set of accepting states. 

A run q qi . . . Q u of B is an infinite sequence such that 
for every i > there exists <Xj £ 2 AP with (q i: at, <7i+i) £ 5. 
The word a ai . . . £ (2 AP ) LJ is called the word induced by 

the run q qi A run q qi ... of B is accepting if there 

exist infinitely many i > such that qi is an accepting state. 

For every LTL formula <j> over AP, one can construct a 
Biichi automaton B^ such that the accepting runs are all and 
only words over 2 AP satisfying <fi [16]. We refer readers to 
[17], [18] for algorithms and to online implementations such 
as [19], to translate an LTL formula to a BA. 

Definition 6: Let T = (S,T,AP,L,w) be an initial- 
ized TS and B = (Q, 2 AP , S, q , F) be a Biichi au- 
tomaton. The product V of T and B is a tuple V = 



(Sp,Tp,sp mit ,AP,Lp,Fp,w v ), where S v = S x Q, 
Tp C S-p x 5-p is a transition relation such that for every 
(s,q),(s',q') G Sp it holds that ((s,q),(s',q')) G 7> if 
and only if (s,s') G T and (q,L(s),q') G (5, s-p mit = 
(sinitiQo) is the initial state, Lp((s,q)) = L(s) is a labeling 
function, Fp = 5 x F is a set of accepting states, and 
w-p(((s,q), (s',q'))) — w((s,s')) is a weight function. 

The product V can be viewed as an initialized TS with a 
set of accepting states. Therefore, we adopt the definitions 
of a run p, a finite run a, its weight wp(a), and sets 
Run p ((s, q)), Run p , Rung n ((s, q)) and Run^ from above. 
Similarly, a cycle eye of V, a strategy Cp for "P and runs 
Pc-p j °cy induced by C-p are defined in the same way as for 
a TS. On the other hand, V can be viewed as a weighted 
BA over the trivial alphabet with a labeling function, which 
gives us the definition of an accepting run of V. 

Using the projection on the first component, every run 
(so,qo)(si,qi) ■■■ and finite run (s , q Q ) . . . (s n , q n ) of V 
corresponds to a run sqS\ . . . and a finite run sq . . . s n of 
T, respectively. Vice versa, for every run sqS\ . . . and finite 
run So ... s n of T, there exists a run (s 0) <?o)( s i> Qi) ■ ■ ■ an d 
finite ran (sq, qo) . . . (s n , q n ). Similarly, every strategy for V 
projects to a strategy for T and for every strategy for T there 
exists a strategy for V that projects to it. The projection of 
a finite-memory strategy for V is also finite-memory. 

Since V can be viewed as a TS, we also adopt the defini- 
tions of a sub-system and a strongly connected component. 

Definition 7: Let V — [Sp,Tp, s-pi n it, AP, Lp, Fp,wp) 
be the product of an initialized TS T and a BA B. An 
accepting strongly connected component (ASCC) of V is an 
SCC U = (S U ,T U , AP, L Vl w v ) such that the set S u H F v 
is non-empty and we refer to it as the set Fy of accepting 
states of U. We use ASCC('P) to denote the set of all ASCCs 
of V that are reachable from the initial state spra- 
in our work, we always assume that ASCC('P) is non- 
empty, i.e., the given LTL formula is satisfiable in the TS. 

III. Problem Formulation 

Consider an initialized weighted transition system T = 
(S,T, AP, L,w). The weight w((s,s')) of a transition 
(s, s') G T represents the amount of time that the transition 
takes and the system starts at time 0. We use t n to denote the 
point in time after the n-th transition of a run, i.e., initially 
the system is in a state sq at time to = and after a finite 
run a G Run^ n (so) of length n + 1 the time is t n = w(cr). 

We assume there is a dynamic penalty associated with 
every state s G S. In this paper, we address the following 
model of penalties. Nevertheless, as we discuss in SecfV] 
the algorithms presented in the next section provide optimal 
solution for a much broader class of penalty dynamics. 
The penalty is a rational number between and 1 that 
is increasing every time unit by -, where r 6 N is a 
given rate. Always when the penalty is 1, in the next time 
unit the penalty remains 1 or it drops to according to a 
given probability distribution. Upon the visit of a state, the 
corresponding penalty is incurred. We assume that the visit 
of the state does not affect the penalty's value or dynamics. 



Formally, the penalties are defined by a rate r £ N and a 
penalty probability function p: S — > (0, 1], where p(s) is the 
probability that if the penalty in a state s is 1 then in the next 
time unit the penalty remains 1, and 1— p(s) is the probability 
of the penalty dropping to 0. The penalties are described 
using a function g : S x N — > \ i G {0, 1, . . . , r}}, such 
that g(s,t) is the penalty in a state s G S at time t G No- 
For s G S, g(s, 0) is a uniformly distributed random variable 
with values in the set | i G {0, 1, . . . , r}} and for t > 1 

g{8>t)= Us,t-l) + i if9 ( M -l)<l, 
I x otherwise, 

where a; is a random variable such that x = 1 with 
probability p(s) and x = otherwise. We use 

ffexp(s) = (1 ~P(S)) ■ \ + P (S) ■ 1 = |(1 + P (S)) (2) 

to denote the expected value of the penalty in a state s G S. 
Please note that I < g C xp(s) < 1, for every s G S. 

In our setting, the penalties are sensed only locally in 
the states in close proximity from the current state. To be 
specific, we assume a visibility range v G N is given. 
If the system is in a state s G S at time t, the penalty 
g(s',t) of a state s' G S is observable if and only if 
s' G Vis(s) = {s' G S | w*(s,s') < v}. The set Vis(s) 
is also called the set of states visible from s. 

The problem we consider in this paper combines the 
formal synthesis problem with long-term optimization of the 
expected amount of penalties incurred during the system's 
execution. We assume that the specification is given as an 
LTL formula <fi of the form 

4> = ip A GF7r sur , (3) 

where <p is an LTL formula over AP and 7r sul G AP. This 
formula requires that the system satisfies tp and surveys the 
states satisfying the property 7r sur infinitely often. We say that 



every visit of a state from the set S& 



{s G S 



L(s)} completes a surveillance cycle. Specifically, starting 
from the initial state, the first visit of S sm after the initial 
state completes the first surveillance cycle of a run. Note 
that a surveillance cycle is not a cycle in the sense of the 
definition of a cycle of a TS in Sec. [II] For a finite run a such 
that last(a) G S sur , jj(er) denotes the number of complete 
surveillance cycles in a, otherwise (j(<r) is the number of 
complete surveillance cycles plus one. We define a function 
Vt.C- S — > Kq such that Vt,c{s) is the expected average 
cumulative penalty per surveillance cycle (APPC) incurred 
under a strategy C for T starting from a state s G S: 



Vf,c(s) — limsupF 



YTi=o9{Pc{i),w{p { c)) 



(4) 



where pc G Run r (s) is the run induced by C starting from 
s and E(-) denotes the expected value. In this paper, we 
consider the following problem: 

Problem 1: Let T = (S, T, AP, L, w) be an initialized 
TS, with penalties defined by a rate r G N and penalty 
probabilities p: S — > (0, 1]. Let v G N be a visibility range 
and <j» an LTL formula over the set AP of the form in 
Eq. ([3j. Find a strategy C for T such that C satisfies <f> and 



among all strategies satisfying 4>, C minimizes the APPC 
value Vr,c(sinit) defined in Eq. (Q. 

In the next section, we propose two algorithms solving the 
above problem. The first algorithm operates offline, without 
the deployment of the system, and therefore, without taking 
advantage of the local sensing of penalties. On the other 
hand, the second algorithm computes the strategy in real-time 
by locally improving the offline strategy according to the 
penalties observed from the current state and their simulation 
over the next h time units, where h > 1 is a natural number, 
a user-defined planning horizon. 

IV. Solution 

The two algorithms work with the product V = 
(Sp,Tp, s-pi n it, AP, L-p, Fp,w-p) of the initialized TS T 
and a Biichi automaton for the LTL formula <f>. To project 
the penalties from T to V, we define the penalty in a state 
(s, q) G Sp at time t as g((s,q),t) = g(s,t). We also 
adopt the visibility range v and the set Vis((s, q)) of all 
states visible from (s, q) is defined as for a state of T. The 
APPC function Vp.c T of a strategy Cp for V is then defined 
according to Eq. (j4j). We use the correspondence between the 
strategies for V and T to find a strategy for T that solves 
Problem [T] Let Cp be a strategy for V such that the run 
induced by Cp visits the set Fp infinitely many times and 
at the same time, the APPC value Vp,c-p{spinit) is minimal 
among all strategies that visit Fp infinitely many times. It 
is easy to see that Cp projects to a strategy C for T that 
solves Problem [T] and Vr,c{sinit) = Vp, Cv (sp inlt ). 

The offline algorithm leverages ideas from formal meth- 
ods. Using the automata-based approach to model checking, 
one can construct a strategy Cp for V that visits at least one 
of the accepting states infinitely many times. On the other 
hand, using graph theory, we can design a strategy Cp that 
achieves the minimum APPC value among all strategies of 
V that do not cause an immediate, unrepairable violation of 
(j), i.e., is satisfiable from every state of the run induced by 
Cp . However, we would like to have a strategy Cp satisfying 
both properties at the same time. To achieve that, we employ 
a technique from game theory presented in [20]. Intuitively, 
we combine two strategies Cp and Cp to create a new 
strategy Cp. The strategy Cp is played in rounds, where 
each round consists of two phases. In the first phase, we play 
the strategy Cp until an accepting state is reached. We say 
that the system is to achieve the mission subgoal. The second 
phase applies the strategy Cp. The aim is to maintain the 
expected average cumulative penalty per surveillance cycle 
in the current round, and we refer to it as the average subgoal. 
The number of steps for which we apply Cp is computed 
individually every time we enter the second phase of a round. 

The online algorithm constructs a strategy Cp by locally 
improving the strategy Cp computed by the offline algo- 
rithm. Intuitively, we compare applying Cp for several steps 
to reach a specific state or set of states of V, to executing 
different local paths to reach the same state or set. We 
consider a finite set of finite runs leading to the state, or 



set, containing the finite run induced by Cp, choose the one 
that is expected to minimize the average cumulative penalty 
per surveillance cycle incurred in the current round and apply 
the first transition of the chosen run. The process continues 
until the state, or set, is reached, and then it starts over again. 

A. Probability measure 

Let Cp be a strategy for V and (s,q) G Sp a state of 
V. For a finite run (Tq t G Rung n ((s. q)) induced by the 
strategy Cp starting from the state (s,q) and a sequence 
t G ({i | < i < r}) + of length \a Cp \, we call (ct Cv ,t) 
a finite pair. Analogously, an infinite pair (pc v , k) consists 
of the run pc v G Run^^s, q)) induced by the strategy Cp 
and an infinite sequence k G ({^ | < i < r }) LJ - A cylinder 
set Cyl{(<Jc-pi T )) °f a finite P a ir {< J c V : T ) is the set °f a U 
infinite pairs (pc v , «) such that r is a prefix of k. 

Consider the er-algebra generated by the set of cylinder 
sets of all finite pairs (oc-p^t), where oc v G Rung n ((s, q)) 
is a finite run induced by the strategy Cp starting from the 
state (s,q) and r G ({- | < i < r})+ is of length 

IctciJ- From classical concepts in probability theory [21], 

p c 

there exists a unique probability measure Pr^ on the cr- 
algebra such that for a finite pair (crc v , t) 
P^ C q {(Cyl((<J Cv ,T))) 

is the probability that the penalties incurred in the first 
\crc v \ + 1 stages when applying the strategy Cp in V from 
the state (s, q) are given by the sequence r, i.e., 

g(<7 Cv {i),wp{a { cl)) = T W 

for every < i < \crc v \- This probability is given by 
the penalty dynamics and therefore, can be computed from 
the rate r and the penalty probability function p. For a 
set X of infinite pairs, an element of the above cr-algebra, 
the probability Pr.jf^X) is the probability that under Cp 
starting from (s, q) the infinite sequence of penalties received 
in the visited states is k where (pc v ,k)eX. 

B. Offline control 

In this section, we construct a strategy Cp for V that 
projects to a strategy C for T solving Problem [T] The 
strategy Cp has to visit Fp infinitely many times and 
therefore, Cp must lead from spi n n to an ASCC. For 
an U G ASCC('P), we denote VJ((s,q)) the minimum 
expected average cumulative penalty per surveillance cycle 
that can be achieved in U starting from (s,q) G Su- Since 
U is strongly connected, this value is the same for all the 
states in Su an d is referred to as VJ. It is associated with a 
cycle cyc^ = Co . . . c m of U witnessing the value, i.e., 

m 

|cyc£nS„ aul .| X! 9exp(Ci) = V~U 
i=0 

where Su SUI is the set of all states of U labeled with 7r sur . 
Since U is an ASCC, it holds S Usm ^ 0. 

We design an algorithm that finds the value VJ and a 
cycle cyc^ for an ASCC U. The algorithm first reduces U 
to a TS U and then applies the Karp's algorithm [22] on U. 
The Karp's algorithm finds for a directed graph with values 




Fig. 1: An example of elimination of a state during the reduction of an 
ASCC U. The finite run runs is equal to the one of the finite runs runi 
and ruri2.run4 that minimizes the sum of expected penalties in the states 
of the run. Similarly, rung is one of the finite runs run7 and runs. rung. 

on edges a cycle with minimum value per edge also called 
the minimum mean cycle. The value and cycle cyc^ are 
synthesized from the minimum mean cycle. 

Let U = {S u ,T u ,AP,L v ,w v ) be an ASCC of V. For 
simplicity, we use singletons such as u, Uj to denote the states 
of V in this paragraph. We construct a TS 

U= {S Usul ,T u ,AP,L v ,w u ), 



and a function run: 



U 



Rung n for which it holds that 



(u, u') € Tjj if and only if there exists a finite run in 
U from u E Suavx to u' E Su sul - with one surveillance 
cycle, i.e., between u and u' no state labeled with 7r sur is 
visited. Moreover, the run run((u, u')) = Uq. . .u n is such 
that u = uq and a = uq . . . u n u' is the finite run in U 
from u to v! with one surveillance cycle that minimizes 
the expected sum of penalties received during a among all 
finite runs in U from u to u' with one surveillance cycle. 
The TS can be constructed from U by an iterative algorithm 
eliminating the states from Su\Susm one by one, in arbitrary 
order. At the beginning let U =U, Tn = Ty, and for every 
transition (u,u') £ Ty let run((u, it')) = u. The procedure 
for eliminating u E Su\Su sur proceeds as follows. Consider 
every «! / u,ii 2 / a such that (u\, u), (u, 112) E Tjj. 
If the transition (111,112) is not in T^j, add (u\,u%) to 
Tj-f and define run((wi, 1*2)) = run((ui, u)).run((u, W2)), 
where . is the concatenation of sequences. If Ty already 
contains the transition (ui,U2) and run((wi, W2)) = cr, we 
set run((u!,u 2 )) = run((iti, u)).run((w, it 2 )), if 

where J2g C xp(x) for a run x is the sum of g exp (x(i)) for 
every state x(i) of x, otherwise we let run((u 1 ,M 2 )) = a. 
The weight w u ((u 1 ,u 2 )) = 5cx P (nin((ui, u 2 ))). Once all 
pairs Mi, U2 are handled, remove u from Sjj and all adjacent 
transitions from Tq. Fig. [TJ demonstrates one iteration of the 
algorithm. 

Consequently, we apply the Karp's algorithm on the ori- 
ented graph with vertices Su sur , edges Tjj and values on 
edges Wjj. Let cyc^ = uq . . . u m be the minimum mean 
cycle of this graph. Then it holds 

in 

V u = ^TT ^gcxp (r\ia((ui,u i+1 mo d(m+i))), 

cyc^ = run((M , u x )) run((tt m _i, it m )).run((u m , u )). 

When the APPC value and the corresponding cycle is 
computed for every ASCC of V, we choose the ASCC 
that minimizes the APPC value. We denote this ASCC 
U = {Su, T U ,AP, L v ,w v ) and cyc^ = c ...c m . 



The mission subgoal aims to reach an accepting state from 
the set Fit, The corresponding strategy Cp is such that from 
every state (s, q) E S-p\Fu that can reach the set Fu, we 
follow one of the finite runs with minimum weight from 
(s, q) to Fu- That means, Ct, is a memoryless strategy such 
that for (s, q) E S-p\F u with Wp((s, q), F u ) < 00 it holds 
C*((s,q)) = (s' >q ') where 

w v ((s, q), (s', q')) = w* v ((s, q), F u ) - w* v {{s> q'),F u ). 

The strategy Cp for the average subgoal is given by 



the cycle cyc^ 



c . . . c. n 



of the ASCC U. Similarly to 



the mission subgoal, for a state (s, q) E S-p\cyCy with 
w v(( s ' c y Q u ) < 00 ' tn e strategy Cp follows one of 
the finite runs with minimum weight to cyc^. For a state 
Ci E cyc^', it holds C^(ci) = c l+1 mod ( m+ i). If all the states 
of the cycle cyc^ are distinct, the strategy CY, is memoryless, 
otherwise it is finite-memory. 

Proposition 1: For the strategy Cp and every state 
(s,q) E Su, it holds 



lim Pr' 



(*,«) 



< 



Equivalently, for every state (s,q) E Su and every e > 0, 
there exists j(e) E N such that if the strategy is played 
from the state (s,q) until at least / > j(e) surveillance 
cycles are completed, then the average cumulative penalty 
per surveillance cycle incurred in the performed finite run is 
at most VJ + e with probability at least 1 — e. 

Proof: (Sketch.) The proof is based on the fact that the 
product V with dynamic penalties can be translated into a 
Markov decision process (MDP) (see e.g., [23]) with static 
penalties. The run p c v corresponds to a Markov chain (see 
e.g., [24]) of the MDP. Moreover, the cycle cyc^ corresponds 
to the minimum mean cycle of the reduced TS U. Hence, 
the equation in the theorem is equivalent to the property 
of MDPs with static penalties proved in [20] regarding the 
minimum expected penalty incurred per stage. ■ 

Remark 1: Assume there exists a state (s, q) E S-p with 
p((s,q)) = 0, i.e., if the penalty in (s,q) is 1, it always 
drops to 0. The dynamics of the penalty in (s, q) is not 
probabilistic and if we visit (s,q) infinitely many times, the 
expected average penalty incurred in (s, q) might differ from 
((s, q)). That can cause violation of Prop. [TJ 

Now we describe the strategy C-p. It is played in rounds, 
where each round consists of two phases, one for each 
subgoal. The first round starts at the beginning of the 
execution of the system in the initial state s-pinit of V. Let 
i be the current round. In the first phase of the round the 
strategy Cp is applied until an accepting state of the ASCC 
U is reached. We use fcj to denote the number of steps we 
played the strategy Cp in round i. Once the mission subgoal 
is fulfilled, the average subgoal becomes the current subgoal. 
In this phase, we play the strategy Cp until the number of 
completed surveillance cycles in the second phase of the 
current round is i, > max{j(|), i - hi}. 

Theorem 1: The strategy Cp projects to a strategy C of 
T that solves Problem [TJ 



Proof: From the fact that the ASCC U is reachable from 
the initial state s-pinit and from the construction of Ct,, it 
follows that U is reached from s-pinit in finite time. In every 
round i of the strategy C-p, an accepting state is visited. 
Moreover, from Prop. IT] and the fact that Zj > max{j(j), i ■ 
fej}, it can be shown that the average cumulative penalty per 
surveillance cycle incurred in the i-th round is at most Vy + 1 
with probability at least 1 — i. Therefore, in the limit, the run 
induced by Cp satisfies the LTL specification and reaches 
the optimal average cumulative penalty per surveillance cycle 
Vy with probability 1. ■ 

Note that, in general, the strategy Cp is not finite-memory. 
The reason is that in the modes of the finite-memory strategy 
we would need to store the number of steps spent so far in 
the first phase fej and the number Zj of the surveillance cycles 
in the second phase of a given round. Since is generally 
increasing with i, we would need infinitely many modes to be 
able to count the number of surveillance cycles in the second 
phase. However, if there exists a cycle cyc^ of the SCC U 
corresponding to that contains an accepting state, then 
the finite-memory strategy Cp for the average subgoal maps 
to a strategy of T solving Problem [T[ which is therefore in 
the worst case finite-memory as well. 

Complexity: The size of a BA for an LTL formula cf> is 
in the worst case 2°"^'', where \<f>\ is the size of <j> [17]. 
However, the actual size of the BA is in practice often quite 
small. The size of the product V is 0{\S\ ■ 2°^). To com- 
pute the minimum weights w*((s, q), (s' , q')) between every 
two states of V we use Floyd-Warshall algorithm taking 
Od^-pl 3 ) time. Tarjan's algorithm [15] is used to compute 
the set SCC(-P) in time 0(\S V \ + \T V \). The reduction of 
an ASCC U can be computed in time 0{\S U \ ■ \T U \ 2 )- The 
Karp's algorithm [22] finds the optimal APPC value and cor- 
responding cycle in time 0(|Sy • \Tjj\). The main pitfall of 
the algorithm is to compute the number of surveillance 
cycles needed in the second phase of the current round i 
according to Prop. [T] Intuitively, we need to consider the 
finite run a c v k induced by the strategy Cp from the current 
state that contains k = 1 surveillance cycles, and compute 
the sum of probabilities Pr, ' ^ p (Cyl((<r c v k^ T ))) f° r every 
t with the average cumulative penalty per surveillance cycle 
less or equal to V£ + i. If the total probability is at least 
1 — j, we set = k, otherwise we increase k and repeat 

the process. For every k, there exist r <Jc v' k sequences r. To 
partially overcome this issue, we compute the number 
only at the point in time, when the number of surveillance 
cycles in the second phase of the current round i is i ■ ki and 
the average cumulative penalty in this round is still above 
?. As the simulation results in 



happens only rarely, if ever. 



Sec. 



VI 



show, this 



C. Online control 

The online algorithm locally improves the strategy C-p 
according to the values of penalties observed from the current 
state and their simulation in the next h time units. The 
resulting strategy Cp is again played in rounds. However, in 
each step of the strategy Cp, we consider a finite set of finite 



runs starting from the current state, choose one according to 
an optimization function, and apply its first transition. 

Throughout the rest of the section we use the following 
notation. We use singletons such as u, Ui to denote the 
states of V. Let a & \\ £ Kun^ n (s-pi n it) denote the finite run 
executed by V so far. Let i be the current round of the 
strategy C-p and o~i = u^o ■ • • u i,k the finite run executed 
so far in this round, i.e., Ui t k is the current state of V. We 
use £j o; ■ ■ ■ , ti,k t° denote the points in time when the states 
mfl, • • ■ , Mi.fe were visited, respectively. 

The optimization function /: Run^u^) —t [0,1] as- 
signs every finite run a = u . . . u n starting from the current 
state a value f(a) that is the expected average cumulative 
penalty per surveillance cycle that would be incurred in the 
round i, if the run a was to be executed next, i.e., 



/(*) = 



3=0 



j=l 

tl(<Ti.CT(l) . . . last{a)) 



(5) 



where g S im(uj,ti,k + wp>{cr [J >)) is the simulated expected 
penalty incurred in the state Uj at the time of its visit. If 
the visit occurs within the next h time units and the state 
Uj is visible from the current state Ui.k, we simulate the 
penalty currently observed in Uj over w-p{a^) time units. 
Otherwise, we set the expected penalty to be g exp (v,j). The 



b-B" 



exact definition of w-p(a^) can be found in Tab. 

For a set of states X C Sp, we define a shorten- 
ing indicator function Ix '■ Tp — > {0, 1} such that for 

(Oijffi), (s2,ft>)) € T v 

fl if wU{sx,qi),X) 
Ix(((si,qi), (S2.92))) = < > w^{{s2,q 2 ),X), (6) 

I otherwise. 

Intuitively, the indicator has value 1 if the transition leads 
strictly closer to the set X, and otherwise. 

In the first phase of every ro und, w e locally improve the 

that aims to visit an 



IV-B 



strategy Cp computed in Sec. 
accepting state of the chosen ASCC U. In each step of the 
resulting strategy Cf,, we consider the set Run^Wi.fc) of all 
finite runs from the current state Ui,k that lead to an accepting 
state from the set Fy with all transitions shortening in the 
indicator Ip u defined according to Eq. |6}, i.e., 

Ruri0(«j >fc ) = {cr e Rurifl n (M ijfc ) | last(a) 6 F u , 

V0<j<|<x|-1: I J , M ((<T(i),<7(i + l))) = l}. 

Let cr € Run^it^) be the run that minimizes the optimiza- 
tion function / from Eq. ^. Then C P (<T aU ) = ct(1). Just 
like in the offline algorithm, the strategy Cf, is applied until 
a state from the set Fu is visited. 

In the second phase, we locally improve the strategy 



for the average subgoal computed in Sec. IV-B to obtain a 
strategy Cp. However, the definition of the set of finite runs 
we choose from changes during the phase. At the beginning 
of the second phase of the current round i, we aim to reach 
the cycle cyc^ = c . . . c m of the ASCC U and we use 
the same idea that is used in the first phase above. To be 
specific, we define Cp(er a n) = rr(l), where a is the finite 



fl(«i,*i,fc) + Wv( r ' ] » «J 6 Vta(ll ilfc ) I tfl J ,(«Ttf)) < ft and flto.ti.fc) + < 1, 

E pst(f) + pst(l) if g VisK, fc ),^( CT 0)) < feandg(it J -,t i , fc )+ ™^ 0) > > i, 
Soxp ( W j ) otherwise . 

p*(=) = ( E • (i - pCkj))^-^ ■ (pW)^-^^ 1 ") ■ (i - ?(•)) ■ = 

if z = w-p((T^>) — (1 — ,ti,k)) ' r ~ x ~ 1 > 0, 2i = 2 div (r - I), 22 =2 mod (r + 1); otherwise if z < 0, pst(— ) = 

P st(i)= E 

where z = Wp{(r^') — (1 — g{uj, t» jfe)) ■ r — 1, 23 = z div (r + 1), 24 = z mod (r + 1) 

TABLE I: The function computing the simulated expected penalty incurred in a state Uj of the run a at the time of its visit t; j. + ufp(cr' 3 °') if we are 
to apply the run a from the current state Uj j,, diu stands for integer division and mod for modulus. 



1 ' W((»(3").^ + !))) = !}• 



run minimizing / from the set 

Run v («i,fc) = {cr e Runfl n (M 4ifc ) | last(a) e cyc^ , 
V0 < j < \a\ 

Once a state c a € cyc^ of the cycle is reached, we continue 
as follows. Let c& € cyc^ be the first state labeled with 7T SU r 
that is visited from c a if we follow the cycle. Until we reach 
the state Cf,, the optimal finite run a is chosen from the set 

Runy(« i:)! ) = {cr e Runj? n (ui jfc ) | last(a) = c b , and 

V0 < j < H - 1: I Cb {(a(j),a(j + 1))) = 1 or 
Wc a ->u i k | + |°1 < b — a + 2 mod (to + 1)}, 

where cr Ca _ i . u . fc is the finite run already executed in V from 
the state c a to the current state u^k- Intuitively, the set 
contains every finite run from the current state to the state c& 
that either has all transitions shortening in I Cb or the length of 
the finite run is such that if we were to perform the finite run, 
the length of the performed run from c a to c& would not be 
longer than following the cycle from c a to When the state 
c;, is reached, we restart the above procedure with c a = 
The strategy is performed until k > max{j(i),i • ki} 
surveillance cycles are completed in the second phase of 
the current round i, where foj is the number of steps of the 
first phase and j is from Prop. [T] We can end the second 
phase sooner, specifically in any time when we complete a 
surveillance cycle and the average cumulative penalty per 
surveillance cycle incurred in the current round is less or 
equal to V* + \. 

Theorem 2: The strategy C-p projects to a strategy C of 
T solving Problem [T] 

Proof: First, we prove that Prop. [T]holds for the strategy 
as well. This result follows directly from the facts below. 
The set of finite runs we choose from always contains a 
finite run induced by the strategy Gp . Once the cycle cyc^ 
is reached, the system optimizes the finite run from one 
surveillance state of the cycle to the next, until it is reached 
after finite time. Finally, if the strategy does not follow 
Cp, it is only because the chosen finite run provides lower 
expected average. The correctness of the strategy C-p is 
now proved analogously to the correctness of the strategy 
computed offline. ■ 

Proposition 2: The strategy Cp is with probability 1 
expected to perform in the worst case as good as the 
strategy Cp computed offline. That means, if the average 
cumulative penalty per surveillance cycle incurred in the so 
far performed run of the system is lower than the optimal 



APPC value VJ, it will rise slower under the strategy Cp 
than under the strategy Cp . On the other hand, if the average 
cumulative penalty per surveillance cycle incurred in the so 
far performed run of the system is higher than the optimal 
APPC value VJ, it is expected to decrease faster under the 
strategy Cp than under the strategy Cp. 

Proof: Follows from the proof of Theorem [2] ■ 
Complexity: The cardinality of the set of finite runs 
Run^u^fc) grows exponentially with the minimum weight 
MJp(tti k, Fit). Analogously, the same holds for the set of 
finite runs Runy (it, f.) and the set cyc y or one of its surveil- 
lance states. To simplify the computations and effectively 
use the algorithm in real time, one can use the following 
rule that was also applied in our implementation in Sec|VI| 
We put a threshold on the maximum weight of a finite run 
in Rtuu(ui fc) and Runy^ t). In the second phase of a 
round, when on the optimal cycle, we optimize the finite 
run from the state c a to the next surveillance state on the 
cycle Cb- However, if the weight of the fragment of the 
cycle from c a to c;, is too high, we can first optimize the 
run to some intermediate state c' b . Also, the complexity of 
one step of the strategy Cp grows exponentially with the 
user-defined planning horizon h. Hence, h should be chosen 
wisely. One should also keep in mind that the higher the 
planning horizon, the better local improvement. 

V. Discussion 

Every LTL formula ip over AP can be converted to a 
formula <p of the form in Eq. for which it holds that 
a run of the TS T satisfies <f> if and only if it satisfies <p. 
The formula is <p = if A GF7r sur where 7r sur € L(s) for 
every s € S. In that case, Problem [T] requires to minimize 
the expected average penalty incurred per stage. 

The algorithms presented in Sec. [IV] can be used to 
correctly solve Problem [T] also for the systems with different 
penalty dynamics than the one defined in Sec. Ill However, 



for every state we need to be able to compute the expected 
value of the penalty in the state, like in Eq. For the online 
algorithm we also require that the dynamics of penalties 
allows to simulate them for a finite number of time units. 
More precisely, if we observe the penalty in a state s E S 
in time t, we can compute the simulated expected value of 
the penalty in s in every following time unit, up to h time 
units, based only on the observed value. 




Total average cumulative penalty per surveillance cycle 



Fig. 2: (a) A TS modeling the robot (black dot) motion in a partitioned 
environment. Two stock locations are in green, a base is shown in blue, 
and unsafe locations are in red. There is a transition between vertically, 
horizontally or diagonally neighboring states. The weight of a horizontal 
and vertical transition is 2, for a diagonal transition it is 3. (b) The penalty 
probabilities in states. Darker shade indicates higher probability. 



The online algorithm from Sec. IV-C is a heuristic. The 
sets of finite runs Run^u^fc), Runy can be defined 

differently according to the properties of the actual problem. 
To guarantee the correctness of the strategy C-p, the sets 
must satisfy the following conditions. There always exists a 
finite run in the set minimizing the optimization function / 
in Eq. |5]). The definition of the set Run^u^) guarantees 
that an accepting state from Fu is visited after finite number 
of steps. The definition of Runy (u^jt) also guarantees a visit 
of the cycle cyc^ in finite time and moreover, Prop. [T] holds 
for the resulting strategy Gp. 

VI. Case study 

We implemented the framework developed in this paper 
for a persistent surveillance robotics example in Java [25]. 
In this section, we report on the simulation results. 

We consider a mobile robot in a grid-like partitioned 
environment modeled as a TS depicted in Fig. |2aj The robot 
transports packages between two stocks, marked green in 



Fig. 2a The blue state marks the robot's base location. The 
penalties in states are defined by rate r = 5 and penalty 
probability function in Fig. [2b] The visibility range v is 6. 



For example, in Fig. 2a the set Vis(s) of states visible from 
the current state s, with corresponding penalties, is depicted 
as the blue-shaded area. We set the planning horizon h = 9. 

The mission for the robot is to transport packages be- 
tween the two stocks (labeled with propositions a, and b, 
respectively) and infinitely many times return to the base 
(labeled with proposition c). The red states in Fig. 2a are 
dangerous locations (labeled with u) which are to be avoided. 
At the same time, we wish to minimize the cumulative 
penalty incurred during the transport of a package, i.e., the 
surveillance property 7r sur is true in both stock states. The 
corresponding LTL formula is 

G(a^X(n«U6)) A G(6^XHUa)) A 

GFc A G(-iu) A GF7r sur , 

and the Biichi automaton has 10 states. The cycle provid- 
ing the minimum expected average cumulative penalty per 
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Fig. 3: (a) The average cumulative penalty per surveillance cycle incurred 
during the runs, shown at the end of each round. The red line marks the 
optimal APPC value, (b) The average cumulative penalty per surveillance 
cycle incurred in every round. The red bars indicate the threshold VJ + ? . 



surveillance cycle is depicted in magenta in Fig. 2a and the 
optimal APPC value is 5.4. 

We ran both offline and online algorithm for multiple 
rounds starting from the base state. In Fig. [3] we report 
on the results for 20 rounds, for more results see [25]. 
As illustrated in Fig. [3^, the average cumulative penalty 
per surveillance cycle incurred in the run induced by the 
offline strategy is above the optimal value and converges 
to it fairly fast. For the run induced by the online strategy, 
the average is significantly below the minimum APPC value 
due to the local improvement based on local sensing. On the 
other hand, Fig. [3}} shows the average cumulative penalty 
per surveillance cycle incurred in each round separately. 
The number of surveillance cycles performed in the second 
phase of every round i of the offline strategy was less 
than i ■ fej, i.e., the second phase always ended due to the 
fact that the average incurred in the round was below the 
threshold + |. The maximum number of surveillance 
cycles performed in the second phase of a round was 7. The 
same is true for the online strategy and the maximum number 
of surveillance cycles in the second phase of a round was 
3. For both algorithms, the number of surveillance cycles in 
the second phase of a round does not evolve monotonically, 
rather randomly. Hence we conclude that in every round i 
we unlikely need to compute the value 
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